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Abstract 

Global  temperature  variations  between  1861  and  1984  are  forecast  using  regularization  network,  mul¬ 
tilayer  perceptrons,  linear  autoregression,  and  a  local  model  known  as  the  simplex  projection  method. 
The  simplex  projection  method  is  applied  to  characterize  complexities  in  the  time  series  in  terms  of  the 
dependence  of  prediction  accuracy  on  embedding  dimension  and  on  prediction-time  interval.  Nonlinear 
forecasts  from  the  library  patterns  between  1861  and  1909  reveal  that  prediction  accuracies  are  optimal 
at  the  embedding  dimension  of  4  and  deteriorate  with  prediction-time  interval.  Regularization  network, 
backpropagation,  and  linear  autoregression  are  applied  to  make  short  term  predictions  of  the  meteorolog¬ 
ical  time  series  from  1910  to  1984.  The  regularization  network,  optimized  by  stochastic  gradient  descent 
associated  with  colored  noise,  gives  the  best  forecasts.  For  all  the  models,  prediction  errors  noticeably 
increase  after  1965.  These  results  are  consistent  with  the  hypothesis  that  the  climate  dynamics  is  charac¬ 
terized  by  low-dimensional  chaos  and  that  the  it  may  have  changed  at  some  point  after  1965,  which  is  also 
consistent  with  the  recent  idea  of  climate  change.  However,  care  must  be  taken  of  such  an  interpretation 
in  that  a  time  series  of  colored  noise  with  few  data  points  that  has  zero  mean  and  many  degrees  of  freedom 
can  also  show  a  similar  behavior. 
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1  Introduction 


In  this  paper  wo  will  apply  some  linear  and  nonlinear 
regression  techniques  to  the  analysis  of  the  time  series  of 
global  temperatnre  variations  between  1861  and  1981. 
We  use  a  data  set  published  by  Jones  et  al.  (1986), 
who  synthesized  global  mean  surface  air  temperature 
differences  between  successive  years  by  correcting  non- 
climatic  factors  in  near-snrface  temperature  data  over 
the  land  and  the  oceans  of  both  hemispheres.  As  pointed 
out  by  Jones  et  al.,  this  time  series  has  the  interesting 
feature  of  having  a  long  timescale  warming  trend  that  is 
remarkable  in  the  1980s,  and  that  is  in  the  right  direction 
and  of  the  correct  magnitude  in  terms  of  recent  ideas  of 
global  warming  (see  figure  1).  It  has  been  recently  con¬ 
jectured  that  it  exists  a  global  climate  dynamical  system, 
that  is  chaotic  and  whose  attractor  has  a  low  dimension¬ 
ality  (4-7).  If  this  is  the  case  it  should  be  possible  to 
model  the  time  series  of  global  temperature  variations 
with  a  model  of  the  type 


x{t  +  l)  =  . a.(/-„))  +  ^,  (1) 

where  /  is  some  unknown  function,  the  time  f  is  ex¬ 
pressed  in  years,  n  is  a  small  number  (of  the  order  of 
the  dimension  of  the  chaotic  attractor)  called  (mh(d(ling 
dimension,  and  are  random  independent  variables  rep¬ 
resenting  uncertainty  in  the  measurements.  In  this  paper 
we  want,  to  investigate  how  well  a  model  of  type  (I)  can 
fit.  and  predict,  the  data,  and  will  use  different  techniqties 
to  reconstruct  the  unknown  function  /,  repre.senting  the 
dynamics  underlying  the  data. 

The  paper  is  organized  as  follows.  In  section  2  we 
introduce  the  general  problem  of  time  series  prediction, 
and  discuss  the  particular  case  of  the  global  temperature 
time  series,  pointing  out  where  the  major  difficulties  are. 
In  section  3  we  use  a  technique  by  Sugihara  and  May 
(1990)  to  estimate  the  minimal  embedding  dimension 
for  the  global  temperature  time  series  (the  number  v  in 
eq.  1)  and  to  test  the  hypothesis  of  chaotic  dynamics.  In 
section  4  we  present  some  experimental  results  obtained 
applying  regularization  networks,  linear  autoregression 
and  multilayer  perceptrons  to  the  prediction  of  this  time 
series.  In  section  5  we  discuss  the  results  and  describe 
some  future  work.  In  the  appendices  we  describe  the  dif¬ 
ferent  forecasting  techniques  that  we  used  for  our  anal¬ 
ysis. 

2  Basic  time  series  analysis 

Making  predictions  is  one  of  the  basic  subjects  in  science. 
Building  a  moded  for  forecasting  a  time  series  is  one  of 
the  tools  that  we  can  use  to  analyize  the  mechanism  that 
generated  the  data.  Given  the  time  series  {j-ffllfLi  of 
observations  of  the  variable  x  at  differents  points  in  time 
there  are  two  extreme  situations  that  we  can  happen  to 
encounter  in  its  analysis: 

1.  the  value  of  t  he  variabh'  x  at  time  t  -|-  r  is  uniquely 
determined  by  the  values  of  the  variabh'  at  certain 
times  in  the  ]iast.  For  exam]rle  a  relation  of  the 
type 


Figure  1:  The  global  temperature  variation  as  re[>orted 
by  Jones  et  al.  (1986). 


•'•{f  +  r)  =  f{x{i).x(f  -  r) . x{t  -  vt)) 

could  hold  for  some  integer  ii  and  some  function 
/.  In  this  case  the  system  is  fully  deterministic, 
and,  in  principle,  its  behaviour  is  predictable.  One 
could  derive  the  mapping  /  from  first  principles, 
provided  that  all  the  interactions  between  elements 
in  the  system  are  clearly  known,  or  reconstruct  its 
shape  from  the  data  {3'(/)}f'_[. 

2.  the  values  of  x{i)  are  independent  random  vari¬ 
ables.  so  that  the  past  values  of  x  do  not  influence 
at  all  its  future  values.  There  is  no  deterministic 
mechanism  underlyiiig  the  data,  and  prediction  is 
not  possible  at  all. 

In  most  practical  and  interesting  cases  one  has  to  deal 
with  time  series  with  properties  that  lie  in  between  these 
two  extremes.  For  e.xample,  one  could  have  a  series  of  ob¬ 
servations  of  a  variable  whose  time  evolution  is  governed 
by  a  deterministic  set  of  equations,  but  the  measurement 
are  affected  by  noise.  In  this  case  a  more  appropriate 
model  for  the  time  series  would  be: 

•(•(f  +  T")  =  /(-i-fO.-eff  -  r) . x(t  -  nr))  -f  ^  (2) 

where  {^, }  is  a  set  of  random  variables.  If  one  could 
exactly  model  the  function  /  the  state  of  the  system  at 
time  i  -f  r  could  be  predicted  within  an  accuracy  that 
depends  only  the  variance  of  the  random  variables  {6}. 
A  system  of  this  type  is  therefore  intrinsically  determin¬ 
istic  but  some  stochasticity  enters  at  the  measurement 
level . 

In  other  ca.ses,  as  for  the  time  series  generated  by  the 
noise  in  a  semiconductor  device,  the  system  could  be 


intrinsically  stochastic.  However,  since  the  observation 
are  correlated  in  time,  it  might  be  still  predictable  to 
some  extent. 

It  is  usually  very  difficult  to  discover,  from  a  time  se¬ 
ries,  what  kind  of  mechanism  generated  the  data,  but 
building  models  of  the  type  (1)  is  often  useful  to  under¬ 
stand  the  relevant  variables  of  the  system  and  its  de¬ 
gree  of  randomness.  Until  not  many  years  ago  most  of 
the  models  used  to  analyze  time  series  were  linear,  and 
therefore  limited  to  succeed  on  the  class  of  linear  prob¬ 
lems.  However,  recent  progress  in  theoretical  and  com¬ 
putational  aspects  in  nonlinear  problems  has  enabled  us 
to  characterize  the  degrees  of  randomness  in  nonlinear 
systems  and  to  forecast  nonlinear  dynamical  behavior.  A 
number  of  authors,  among  which  Farmer  and  Sidorowich 
(1987),  Casdagli  (1989),  Sugihara  and  May  (1990),  have 
applied  and  developed  novel  nonlinear  regression  tech¬ 
niques  to  predict  chaotic  time  series,  and  shown  that 
short-term  predictability  can  often  be  achieved  with  a 
good  degree  of  accuracy.  They  have  also  demonstrated 
that  low-dimensional  chaos  can  be  distinguished  from 
white  noise  according  to  nonlinear  forecasts  of  the  dy¬ 
namical  behavior. 

One  could  find  the  minimal  embedding  dimension  and 
valid  prediction-time  interval  through  applying  a  vari¬ 
ety  of  network  structures  by  trial  and  error  of  cross- 
validation  techniques.  Such  procedure,  however,  is  likely 
to  be  computationally  expensive,  and  one  of  the  ways  to 
circumvent  such  problem  is  to  resort  to  local  approxima¬ 
tion  such  as  the  algorithm  of  Sugihara  and  May  (1990). 
In  local  approximation,  predictions  are  made  from  only 
nearby  states  in  the  past,  so  that  it  consumes  less  com¬ 
putational  time.  The  local  approximation,  however,  has 
shortcomings  that  the  mapping  is  discontinuous,  which 
results  in  less  sufficient  prediction  accuracy  than  global 
approximation  such  as  neural  networks.  Thus  the  asso¬ 
ciation  of  the  local  approximation  with  neural  networks 
is  an  effective  way  for  building  a  model  to  make  predic¬ 
tions  of  complex  time  series.  One  could  find  the  min¬ 
imal  embedding  dimension  and  the  valid  prediction  in¬ 
terval  using  the  local  approximation  without  consuming 
a  lot  of  computational  time  and  forecast  the  time  series 
with  good  prediction  accuracy  using  neural  networks  the 
structures  of  which  are  determined  according  to  the  fore¬ 
casts  by  the  local  approximation. 

2.1  Global  temperature  time  series  and  the 
climate  attractor 

It  has  been  controversial  whether  the  climate  is  low¬ 
dimensional  chaos  or  not  (Nicolis  and  Nicolis,  1984; 
Grassberger,  1984;  Essex,  bookman  and  Nerenberg, 
1987;  Lorenz,  1991).  Nicolis  and  Nicolis  (1984)  first 
claimed  the  existence  of  a  low-dimensional  climate  at¬ 
tractor  in  terms  of  the  correlation  dimension  of  the 
time  series  synthesized  by  interpolations  from  an  isotope 
record  of  deep-sea  cores.  Grassberger  (1984),  however, 
argued  that  their  estimate  may  reflect  not  the  actual 
climatic  dynamics  but  the  artifact  due  to  the  interpola¬ 
tion.  Meanwhile,  Essex  et  al.  (1987)  published  a  calcu¬ 
lation  on  the  correlation  dimension  for  non-filtered  time 


series  of  daily  geopotential  observations.  They  agreed 
with  the  existence  of  such  climate  attractor.  Recently 
Lorenz  (1991)  expressed  doubts  on  the  interpretations 
of  the  previons  calculations.  He  claimed  hat  a  low¬ 
dimensional  attractor  is  unlikely  to  exist  for  the  global 
climate,  although  the  climatic  subsystems  could  be  low¬ 
dimensional  chaos. 

3  Is  there  a  global  climatic  attractor? 

In  this  section  we  first  discuss  the  plausibility  of  the 
hypothesis  that  there  is  low-dimensional  chaotical  sys¬ 
tem  underlying  the  global  temperature  time  series,  and 
then  present  some  results  about  its  short-term  predic¬ 
tions  with  a  number  of  methods.  We  use  a  data  set 
published  by  Jones  et  al.  (1986),  who  synthesized  global 
mean  surface  air  temperature  differences  between  suc¬ 
cessive  years  by  correcting  non-climatic  factors  in  near¬ 
surface  temperature  data  over  the  land  and  the  oceans  of 
both  hemispheres.  Although  some  questions  such  as  the 
influence  of  urbanization  effect  are  raised  about  the  sig¬ 
nificance  of  the  data  (Wu,  Newell  and  Hsuing,  1990)  we 
still  think  it  is  of  interest  to  see  what  kind  of  information 
can  be  extracted  from  this  data  set. 

3.1  Testing  for  chaos 

We  have  used  the  technique  proposed  by  Sugihara  and 
May  (1990)  to  understand  the  nature  of  the  time  se¬ 
ries  we  decided  to  analyze.  The  technique  consists  in 
studying  the  behaviour  of  the  correlation  coefficient  be¬ 
tween  the  prediction  and  the  target  as  a  function  of  how 
many  steps  in  the  future  we  are  trying  to  predict,  and 
of  the  embedding  dimension.  Sugihara  and  May  argued 
that  looking  at  the  rate  of  decrease  to  zero  of  the  cor¬ 
relation  coefficient  it  is  possible  to  distinguish  chaotic 
systems  from  non-chaotic  ones.  In  our  experiments,  fol¬ 
lowing  Sugihara  and  May,  predictions  have  been  done 
according  to  the  simplex  projection  technique  described 
in  appendix  (A. 4). 

Figures  (2)  and  (3)  show  the  Sugihara-May  technique 
applied  to  a  number  of  synthetic  time  series.  Figure  (2) 
shows  a  plot  of  the  correlation  coefficient  as  a  function  of 
embedding  dimension  for  predictions  of:  a)  white  noise 
(open  triangles  and  dashed  line);  b)  f~^  °-noise  (solid 
triangles  and  dotted  line);  c)  Henon  mapping  superim¬ 
posed  on  white  noise  (solid  squares  and  short-dashed 
line),  d)  sine  wave  superimposed  on  white  noise  (open 
circles  and  long-dashed  line).  The  white  random  noise 
has  been  synthesized  using  equation  (8)  in  appendix  (B) 
with  a  =  0.  In  Figure  (2),  predictions  have  been  made 
on  a  test  set  of  150  vectors  from  a  data  set  of  150  exam¬ 
ples. 

Figure  (3)  shows  a  plot  of  the  correlation  coefficient 
as  a  function  of  prediction-time  step  for  the  time  series 
shown  in  figure  (2).  In  figure  (3),  predictions  have  been 
done  assuming  an  embedding  dimension  n  =  3,  in  similar 
way  to  Figure  (2). 

The  results  in  figure  (2)  and  figure  (3)  reveal  charac¬ 
teristics  of  complexities  in  each  time  series.  The  Henon 
mapping  superimposed  on  white  noise  has  a  peak  em¬ 
bedding  dimension  and  its  prediction  accuracy  deterio¬ 
rates  quickly  with  prediction-time  step.  This  is  typical  of 


low-dimensional  chaos.  The  sine  wave  supcrimpo.scd  on 
white  noise  has  long-term  predictability  indeirendently  of 
embedding  dimen.sion.  White  noise  is  unpredictable  at 
all.  It  should  be  noticed  that  the  ®-noise  has  short¬ 
term  predictability  independently  of  embedding  dimen¬ 
sion.  According  to  some  recent  work  (Miyano,  1994; 
Miyano  et  al.  1992),  such  characteristics  are  also  found 
for  an  actual  colored  noise  observed  for  a  semiconductor 
device.  The  flat  relation  between  the  correlation  coeffi¬ 
cient  and  embedding  dimension  is  an  important  indica¬ 
tor  in  distinguishing  low-dimensional  chaos  from  colored 
noise. 

We  applied  the  same  techniques  described  above  to  the 
time  series  observed  by  Jones  et  al.  (1986).  The  time 
series  has  been  obtained  by  hand-scanning  the  original 
figures  on  the  paper.  Therefore,  the  present  time  series 
includes  some  read  error.  We  use  the  first  45  input  vec¬ 
tors,  i.e.,  the  dat.a  from  1861  to  1915.  as  training  data, 
being  motivated  by  the  idea  that  a  climate  change  may 
have  occurred  around  the  mid  20th-century  by  factors 
such  as  increasing  energy  consumption  and  environmen¬ 
tal  pollution.  The  correlation  coefficient  as  a  function  of 
the  embedding  dimension  for  predictions  of  the  meteo¬ 
rological  time  series  is  presented  in  Figure  (4).  Forecasts 
up  to  1944  and  up  to  1984  arc  shown  by  solid  circles 
and  solid  line,  and  by  o]ien  circles  and  dashed  line,  re¬ 
spectively.  Notice  that  the  plot  has  a  peak  at  n  =  4, 
suggesting  that  this  time  series  has  been  generated  by 
a  dynamical  system  with  a  low  dimensional  attractor  of 
dimension  4. 

A  plot  the  correlation  coefficient  versus  the 
prediction-time  step  is  shown  in  Figure  (5),  where  the 
embedding  dimension  has  been  set  to  4,  according  to 
the  results  of  figure  (4).  Forecasts  from  1910  to  1944 
and  from  1910  to  1984  arc  indicated  by  .solid  circles  and 
solid  line,  and  by  open  circles  and  dashed  line,  respec¬ 
tively.  The  prediction  accuracy  deteriorates  rapidly  with 
increasing  time  step  in  both  cases.  This  indicates  that 
the  time  series  has  only  short-term  predictability.  The 
plots  shown  in  Figure  (4)  and  Figure  (5)  appear  to  be 
typical  of  a  low-dimensional  chaotic  time  series.  Such  di¬ 
agnosis,  however,  is  dangerous,  since  colored  noise  with 
many  degrees  of  freedom  can  also  provide  a  similar  trend 
in  prediction,  when  handling  relatively  small  number  of 
data  points. 

Figure  (6)  and  Figure  (7)  show  a  plot  of  the  corre¬ 
lation  coefficient  versus  the  embedding  dimension  and 
prediction-time  step  respectively  for  *  (Miyano. 
1994;  Miyano  et  al.  1992).  The  fractal  dimension 
was  estimated  using  the  algorithm  develojied  by  Higuchi 
(1988).  The  power  spectrum  of  t  he  random  noise  can  be 
described  as  ®  with  resjiect  to  frequency  /,  accord¬ 
ing  to  t  he  relation  between  the  fractal  dimension  D  and 
t.he  ])ower  law  index  n  of  D  =  (5  —  n)/2.  In  both 
Figure  (6)  and  Figure  (7),  solid  circles  and  solid  line  cor- 
res]iond  to  training  and  testing  set  size  of  50,  while  open 
circles  and  dashed  line  correspond  to  training  and  test 
size  of  500.  Not  ice  the  sensitivity  to  the  miinber  of  data. 


4  Forecasting  the  global  temperature 
time  series 

We  tested  three  different  approximation  techniques  to 
make  predictions  one  step  ahead,  with  an  embedding 
dimension  equal  to  4: 

1.  Gaussian  Hyper  Basis  Functions  with  a  linear  term. 
A  network  of  5  Gaussian  units  has  been  used,  to 
which  a  linear  term  has  been  added.  The  linear 
term  has  been  computed  first,  and  then  the  residu¬ 
als  have  been  approximated  with  the  gaussian  net¬ 
work,  Minimization  of  the  mean  square  error  was 
run  over  the  coefficients,  the  centers  and  the  vari¬ 
ances  of  the  gaussians,  for  a  total  of  30  free  param¬ 
eters. 

2.  Multilayer  Perceptron  with  one  hidden  layer.  A 
standard  Multilayer  Perceptron  network  was  used 
for  the  prediction,  with  4  hidden  sigmoidal  units, 
for  a  total  of  24  parameters. 

3.  linear  regression.  A  linear  regression  model  was  fit¬ 
ted  to  the  data,  using  the  statistical  package  .Spins 
(Becker,  Chambers  and  Wilks,  1988). 

W  e  use  the  first  45  data  points,  i.e.,  the  time  .series 
from  1861  to  1909,  as  training  set,  and  tested  the  re¬ 
sulting  approximation  on  three  different  test  sets:  1910- 
1944.  1910-1964,  1910  1984.  For  each  experiment  we 
measured  the  root  mean  square  error  s: 


where  the  sum  runs  over  the  elements  of  tin'  set  being 
tested,  Xa  are  test  values  and  x^  are  the  values  predicted 
by  the  model.  We  also  measured  for  each  test  set  the 
variance 


where  <  a'  >  is  the  average  value  of  the  test  set.  Notice 
that  the  quantity  ~  is  particularly  significant,  because 
if  it  has  value  zero  then  predictions  are  perfect,  while  if 
it  is  equal  to  1  then  predictions  are  no  better  that  the 
average  of  the  targets. 

The  experimental  results  for  r  and  (t  have  been  rei)orted 
in  fable  (1).  while  the  forecasts  are  shown  in  figures  (8), 
(9)  and  (10)  respectively.  In  the  upper  part  of  the  figures 
we  displayed  the  observed  time  series,  representixl  by  a 
dashed  line,  and  the  trained  model,  represented  by  the 
solid  line.  Solid  circles  have  been  used  for  the  test  set 
and  white  circles  for  the  training  set.  In  the  lower  part  of 
the  figures  the  residuals  of  the  approximation  have  been 
shown. 

It  is  clear  that  the  Hyper  Basis  Function  moch'l  maki's 
best  forecasts  and  that  the  linear  regression  makes  the 
worst.  This  suggest  that  the  dynamical  bi'havior  of  tin' 
time  series  is  nonlinear.  It  should  be  noticed,  however, 
that  prediction  error  increases  remarkably  afti'i'  1965  for 
all  the  models,  although  it  is  more  pronounci'd  in  tin' 


3 


Linear 

HBF 

MLP 

e  (training) 

0.10 

0.09 

0.90 

e  (1910-1944) 

0.14 

0.10 

0.12 

e  (1910-1964) 

0.14 

0.10 

0.13 

e  (1910-1984) 

0.16 

0.12 

0.15 

e/cr  (training) 

0.84 

0.83 

0.81 

e/(7  (1910-1944) 

1.13 

0.82 

1.01 

e/cr  (1910-1964) 

1.25 

0.88 

1.11 

e/cr  (1910-1984) 

1.31 

0.98 

1.18 

Table  1:  Root-mean-squared  prediction  error  (e)  and 
normalized  Root-mean-squared  prediction  error  (e/cr) 
for  the  3  technique  we  tested.  See  text  for  explanation 


Multilayer  Perceptron  and  linear  regression.  This  is  not 
inconsistent  with  the  idea  of  global  warming  suggested 
by  Jones  et  al.  In  fact,  assume  that  a  model  is  success¬ 
ful  in  learning  the  dynamics  underlying  the  meteorologi¬ 
cal  time  series  that  correspond  to  the  period  1861-1909. 
Then,  one  possible  interpretation  for  the  increase  in  the 
prediction  error  after  1965  is  that  a  change  in  the  climate 
dynamics  took  place  at  some  point  after  1965. 

Such  straightforward  interpretation  could  be,  how¬ 
ever,  a  misdiagnosis,  since  the  trend  in  prediction  error 
could  be  due  to  failure  in  generalizing  the  underlying  dy¬ 
namics.  In  fact,  according  to  our  recent  work  (Miyano, 
1994),  a  similar  trend  in  prediction  error  can  also  be 
present  in  forecasting  colored  noise  with  relatively  few 
data  points.  Figure  (11)  shows  the  ®-noise  observed 
for  a  semiconductor  device  (Miyano  et  al.  1992).  In  Fig¬ 
ure  (12),  we  present  results  of  learning  and  predictions 
of  the  random  noise  by  regularization  network  with  3 
input  nodes  and  5  hidden  nodes  without  linear  terms. 
Figure  (13)  shows  residuals  obtained  by  subtracting  the 
corresponding  target  from  each  prediction.  We  use  first 
250  points  in  the  training  set.  In  this  case,  the  opti¬ 
mal  embedding  dimension  is  assumed  to  be  3  accord¬ 
ing  to  the  results  shown  in  Figure  (7).  The  predictions 
agree  well  with  the  targets  except  for  the  portion  from 
time  =  300  to  350  during  which  the  network  forecasts 
lower  values  than  observed.  From  Figures  (12)  and  (13), 
one  would  make  a  misdiagnose  that  the  time  series  was 
low-dimensional  chaos  and  that  the  discrepancy  between 
time  =  300  and  350  indicated  some  change  in  the  dy¬ 
namics.  The  discrepancy  is,  however,  clearly  due  to  fail¬ 
ure  in  generalization. 

5  Discussion  and  open  problems 

The  present  considerations  on  the  meteorological  time 
series  leads  to  two  possible  interpretations  of  the  global 
climate:  one  is  that  a  low-dimensional  climate  attrac¬ 
tor  may  exist  and  that  the  climate  dynamics  may  have 
altered  at  some  point  after  1965;  the  other  is  that  the 


temperature  variations  may  be  colored  noise  with  many 
degrees  of  freedom.  The  latter  interpretation  would  lead 
to  the  following  forecast  of  the  future  trend  in  the  cli¬ 
mate:  the  global  temperature  would  begin  to  decrease  at 
some  point  in  the  future,  since  colored  noise  has  a  zero 
mean  in  a  long  time  period.  Within  the  framework  of 
the  present  study  we  can  dismiss  neither  of  the  interpre¬ 
tations.  The  present  work  is  still  in  a  preliminary  stage. 
In  a  future  paper,  we  plan  to  forecast  non-hand-scanned 
meteorological  time  series  with  more  data  points  and  to 
clarify  whether  the  climate  is  low-dimensional  chaos  or 
not. 
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A  Regression  techniques 

In  this  section  we  describe  the  different  regression  tech¬ 
niques  that  we  have  used  to  forecast  and  analyze  the 
global  temperature  time  series.  The  first  three  are  based 
on  the  assumption  that  the  data  can  be  well  approxi¬ 
mated  by  a  parametric  function  of  the  form 

n 

/(x)  =  ^  c„i7(x;Wa) 

a=l 

where  the  Ca  and  the  are  considered  free  parameters 
and  are  found  by  a  least  squares  technique.  Models  of 
this  type  are  usually  called  “neural  networks”,  because 
they  can  be  represented  by  a  network  with  one  layer 
of  hidden  units.  The  last  technique  is  a  local  technique, 
similar  in  spirit  to  nearest  neighbor  models  and  to  kernel 
regression. 

A.l  Linear  model 

The  simplest  model  of  the  form  (1)  that  we  can  build  is 
one  in  which  the  function  /  is  linear,  and  therefore  the 
relation  between  the  past  values  of  x  and  the  future  ones 
is  of  the  type: 

x{t)  =  aa-\-aix{t-l)+a2x{t—2)-\ - \-anx{t—n)+(,t  (3) 

This  is  the  so  called  AR  (autoregressive)  model,  and  is 
one  of  the  many  linear  models  that  have  been  developed 
in  time  series  analysis.  There  is  clearly  a  huge  litera¬ 
ture  about  linear  models  (see  for  example  Myers,  1986; 
Draper  and  Smith,  1981;  Searle,  1971)  and  we  will  not 
spend  more  time  on  this  topic  in  this  paper.  In  our  ex¬ 
periments  the  coefficients  of  the  model  have  been  com¬ 
puted  according  to  standard  least  square  routines,  that 
correspond  to  assuming  that  the  random  variables  in 
eq.  (3)  are  independent  and  have  Gaussian  distribution 
with  zero  mean.  The  statistical  package  Spins  (Becker, 
Chambers  and  Wilks,  1988)  has  been  used  to  perform 
this  calculation. 


A. 2  Gaussian  Hypor  Basis  Functions  with  a 

linear  term 

One  of  the  ]^arametiizations  we  used  for  the  prediction 
task  had  the  following  form: 

II 

/(x)  =  ^  +  a  •  X  +  f/  .  (4) 

0  =  1 

The  coefficients  Co,a,  cl,  the  centers  t^,  the  widths  Mq  of 
the  gaussians  were  considered  as  free  parameters.  Tech¬ 
niques  of  this  type  are  quite  common  in  the  neural  net¬ 
work  literature  (Moody  and  Darken,  1989;  Poggio  and 
Girosi,  1990),  although  the  linear  term  has  not  heed 
used  very  often.  When  the  widths  of  gaus.sians  have 
all  the  same  value  the  technique  is  a  particular  case  of 
is  a  particular  case  of  a  general  class  of  approximation 
techniques,  called  regularization  networks,  that  share  the 
property  that  they  can  be  justified  in  the  framework  of 
regularization  theory  (Girosi,  Jones  and  Poggio,  1993) 
and  can  be  represented  by  a  network  with  one  layer  of 
hidden  units. 

Usually  the  parameters  of  the  network  are  found  by 
minimizing  the  mean  square  error  by  some  numerical 
minimization  technique.  In  our  case,  since  the  number 
of  data  points  available  is  very  small,  we  do  not  expect  to 
bo  able  to  model  surfaces  much  more  complicated  than 
an  hyperplane.  Therefore,  we  first  fit  an  hyperplane  to 
the  data  (the  term  a  •  x  +  c/),  and  then  estimate  the 
parameters  of  the  rest  of  the  expansion  by  fitting  the 
residuals  of  the  liy]ierplane  approximation,  choosing  a 
small  number  n  of  basis  functions.  The  gaussian  basis 
functions  are  therefore  to  be  considered  as  a  ‘‘correc¬ 
tive  term”  to  a  linear  approximation  technique.  The 
minimization  technique  we  used  for  estimating  the  pa¬ 
rameters  of  the  gaussian  basis  functions  is  a  stochastic 
gradient  descent,  algorithm,  described  in  appendix  B. 

A. 3  Multilayer  perceptroiis 

Multilayer  perceptrons  (MLP)  are  an  approximation 
technique  that  is  based,  in  its  most  common  form,  on 
the  following  parametric  representation: 

11 

fix)  -  ■  W,'  -f  ^/)  , 

i  =  l 

where  a  is  a  sigmoidal  function,  that  in  our  case  we 
set  to  ,  /  _ ,  ,  that,  is  one  the  most  common  choices.  In 
our  computations  the  parameters  of  the  network  have 
been  estimated  using  backpropagation  and  the  general¬ 
ized  delta  rule  (Rumelhart,  Hinton  and  Williams.  1986; 
Hecht-Nielsen,  1989),  that  is  a  form  of  stochastic  gradi¬ 
ent  descent. 

A. 4  Simplex  projection  method 

The  simplex  projection  method,  that  has  been  used  by 
Sugihara  and  May  (1990)  to  tell  chaotic  time  series  from 
non  chaotic  ones,  is  a  local  approximation  method,  very 
close  to  the  1-nearest  neighbour  technique  and  kernel 
regression.  Sup]rose  that  a  data  set  of  N  data  points  in 
d  dimensions  has  been  given,  consisting  of  input-output 
pairs  {(x, ,;(/,  )}, ^1  fhat  have  been  obtained  by  randomly 


sampling  an  unknown  function  /  in  presence  of  noise. 
When  the  value  of  /  at  a  point  x  that  does  not  belong 
to  the  data  set  has  to  be  computed,  first  its  closest  f/ -f  1 
points  x,j, . .  in  the  data  set  ar('  found.  Then  the 

value  of  the  function  at  x  is  estimated  by  a  weighted 
average  of  the  values  of  the  function  at  the  data  i)oints 
x,j , . . .  Xij^j ,  that  is  the  values  y,-, , . . .  .  Points  that 

are  more  far  from  x  receive  a  smaller  weight,  according  to 
an  exponential  decay.  In  formulas,  the  estimated  value 
of  /  at  X  is 


/(x) 


where  we  have  defined 


E 


1 


E"', 


da  =  ||x-XiJ| 

and  cr  is  a  parameter  that  define  the  locality  of  the  tech¬ 
nique,  and  can  be  set  with  cross-validation  techniques. 
This  technique  can  be  considered  as  an  approximation 
of  kernel  regression  with  f  as  a  kernel.  In  fact  in  this 
case  kernel  regression  would  have  the  form: 


/(X) 


Ef.i?/T--llx-x.|l 

^f^^e-c||x-x,|| 


(5) 


The  method  we  use  in  this  paper  is  equivalent  to  take 
only  the  closest  d  -H  1  points  in  the  expansion  (5). 


B  Stochastic  gradient  descent 

Let  H(f)  be  a  function  that  has  to  be  minimized  with 
respect  to  the  set  of  parameters  f .  The  standard  gradi¬ 
ent  descent  algorithm  is  an  iterative  algorithm  in  which 
the  set  of  parameters  evolves  toward  the  minimnm  ac¬ 
cording  to  the  following  law: 


df  _  dHjg 

di  ""  df 


(6) 


where  /  is  a  time  parameter  in  the  iteration  looj)  of  the 
optimization  proce.ss  and  u;  >  0  is  called  learning  rate. 
One  of  the  many  inconveniences  of  this  algorithm  is  that, 
if  converges,  it  converges  to  a  local  minimum.  Since  in 
the  optimization  problems  arising  from  the  training  of  a 
neural  network  it  is  known  that  multiple  local  minima 
are  present,  it  is  important  to  have  a  techniciue  that  is 
able  to  escape  at  least  some  of  the  local  minima  and  get 
close  to  the  global  minimum.  A  sim[)l('  way  to  achi('ve 
this  consists  in  adding  a  stochastic  term  in  ec[.  (6),  that 
becomes: 


dHiO 

dt  ^  df 


+  nit) 


(7) 


where  ?;(/)  is  random  noise.  As  an  effect  of  the  addition 
of  the  noise  term  ?/  the  set  of  parameters  f  will  not  always 
follow  the  direction  of  the  gradient,  and  will  sometime 
go  uphill  instead  of  downhill,  having  a  chance  to  escape 
some  local  minima.  In  our  case  tlu'  random  noise  ?/(/) 
was  synthesized  by  the  following  equation: 


b 


X 

ri{t)  =  '^^(27rck)~^  [Acos(2irckt)  +  Bsin(2wckt)]  (8) 
fc=i 

where  c  and  A"  are  set  to  0.01  and  20000,  respectively, 
and  A  and  B  random  numbers  lying  between  0  and  1. 
The  power  spectrum  of  r](t)  is  given  as  with  respect 
to  the  frequency  /.  In  order  to  verify  the  validity  of 
the  algorithm,  we  adapt  it  to  optimizing  the  regulariza¬ 
tion  network  to  predict  a  chaotic  time  series  synthesized 
by  Henon  mapping  (embedding  dimension  is  set  to  2. 
White  noise  (i.e.,  a  =  0),  /“^-noise  (pink  noise),  and 
/“2-noise  are  used  as  perturbation.  It  is  found  that  /“*- 
noise  works  as  well  as  white  noise,  while  /“^-noise  does 
not.  The  value  of  the  normalized  root  mean  square  er¬ 
ror  ejcr  for  a  test  set  of  1000  points  and  a  training  set  of 
1000  examples  was  0.097  for  a  network  with  5  basis  func¬ 
tions  and  no  linear  term,  in  the  case  in  which  /“^ -noise 
on  1000  examples.  This  good  result  confirmed  that  the 
stochastic  gradient  descent  algorithm  worked  correctly 
and  achieved  some  good  local  minimum. 
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Figure  2:  Plot  of  tlic  correlation  coefficient  as  a  function 
of  emheclding  dimension  for  various  synthetic  time  series; 
w'hite  noise  (open  triangles  and  dashed  line);  *^-noise 
(solid  triangles  and  dotted  line);  Henon  mapping  super¬ 
imposed  on  white  noise  (solid  squares  and  short-dashed 
line);  and  sine  wave  superimposed  on  white  noise  (open 
circles  and  long-dashed  line).  Predictions  on  a  test  sol 
of  150  points  were  made  from  a  training  set  of  150  ex¬ 
amples. 


Figure  4;  Plot  of  the  correlation  coefficient  as  a  function 
of  embedding  dimension  for  the  meteorological  time  se¬ 
ries.  Predictions  are  made  from  the  first  45  data  points, 
i.e.,  the  data  from  1861  to  1915.  Forecasts  up  to  1944 
and  up  to  1984  are  shown  by  solid  circles  and  solid  line, 
and  by  open  circles  and  dashed  line,  respectively. 
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Figure  3;  Plot  of  the  correlation  coefficient  as  a  function 
of  prediction-time  steps  for  various  .synthetic  time  series; 
white  noise  (open  triangles  and  dashed  line);  '^-noise 
(solid  triangles  and  dotted  line);  Henon  mapping  super¬ 
imposed  on  white  noise  (solid  squares  and  short-dashed 
line);  and  sine  wave  su])erimpo.sed  on  white  noise  (open 
circles  and  long-dashed  line).  Predictions  on  a  test  set 
of  150  points  were  made  from  a  training  set  of  150  ex¬ 
amples.  The  embedding  dimension  is  set  to  3. 


Time  step 

Figure  5;  Plot  of  the  correlation  coefficient  as  a  function 
of  prediction-time  step  for  the  meteorological  time  series. 
Forecasts  are  made  from  the  first  45  data  points,  i.e.,  the 
data  from  1861  to  1909.  Forecasts  from  1910  to  1944  and 
from  1910  to  1984  are  shown  by  solid  circles  and  solid 
line,  and  by  open  circles  and  dashed  line,  respectively. 
The  embedding  dimension  has  been  set  to  4,  according 
to  the  results  of  fig.  (4). 
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Figure  6:  Plot  of  the  correlation  coefficient  as  a  function 
of  embedding  dimension  for  colored  noise  with  fractal 
dimension  of  1.603.  The  random  noise  is  not  synthetic 
but  a  leakage- current  fluctuations  observed  for  a  semi¬ 
conductor  device  (Miyanoet  ah,  1992).  Solid  circles  and 
solid  line:  Predictions  on  a  test  set  of  50  points  are  made 
from  a  training  set  of  50  examples.  Open  cicles  dashed 
line:  Predictions  on  a  test  set  of  500  points  and  a  training 
set  of  500  examples. 


Figure  7 :  Plot  of  the  correlation  coefficient  as  a  function 
of  prediction-time  step  for  the  random  noise  observed  in 
a  semiconductor  device.  The  embedding  dimension  is 
set  to  3.  Solid  circles  and  solid  line:  Predictions  on  a 
test  set  of  50  points  are  made  from  a  training  set  of  50 
examples.  Open  cicles  dashed  line:  Predictions  on  a  test 
set  of  500  points  and  a  training  set  of  500  examples. 


Figure  8:  Forecasts  of  the  meteorological  time  series  by 
regularization  network:  forecasts(upper  part);  and  resid¬ 
uals  obtained  by  subtracting  the  corresponding  target 
from  each  prediction  (lower  part).  The  network  has  been 
trained  on  the  first  45  data  points,  i.e.,  the  time  series 
from  1861  to  1909.  Solid  circles  and  solid  line  indicate 
predictions  on  the  input  vectors  that  the  network  has 
not  seen.  Open  circles  and  solid  line  indicate  results  of 
training.  The  observed  time  series  is  shown  by  dashed 
lines. 
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Figure  9:  Forecasts  of  tlie  meteorological  time  series 
by  backpro]iagation  network:  forecasts! upper  part);  aucl 
rcsirluals  obtained  by  subtracting  the  corresponding  tar¬ 
get  from  each  prediction  (lower  part).  The  network  has 
been  trained  on  the  first  45  data  points,  i.e.,  the  time 
series  from  1861  to  1909.  Solid  circles  and  solid  line  in¬ 
dicate  predictions  on  the  input  vectors  that  the  network 
has  not  seen.  Open  circles  and  solid  line  indicate  results 
of  training.  The  observed  time  series  is  shown  by  daslied 
lines. 
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Figure  10:  Forecasts  of  the  meteorological  time  series  by 
linear  autoregression:  forecasts! npiier  Pfrt);  and  residu¬ 
als  (lower  part).  The  model  has  been  trained  on  the  first 
45  data  points,  i.e.,  the  time  series  from  1861  to  1909. 
Solid  circles  and  solid  line  indicate  predictions  on  tin'  in¬ 
put  vectors  that  the  network  has  not  seen.  Open  circles 
and  solid  line  indicate  resrdts  of  fitting.  The  observed 
time  series  is  shown  by  dashed  lines. 
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Figure  11:  The  ®-noise  observed  for  a  semiconductor 
device  (Miyano,  1994;  Miyano  et  ah,  1992). 
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Figure  12:  Results  of  training  (from  0  to  249)and  pre¬ 
dictions  (from  250  to  500)  of  the  ®-noise  using  Gaus¬ 
sian  Hyper  Basis  Functions  without  linear  terms.  The 
embedding  dimension  is  set  to  3.  The  network  is  trained 
for  first  250  library  patterns,  i.e.,  the  time  series  from 
time  =  0  to  250.  The  predictions  agree  well  with  the 
targets  except  for  the  portion  from  time  =  300  to  350 
during  which  the  network  forecasts  lower  values  than  ob¬ 
served. 


Figure  13:  Residuals  of  the  approximation  of  /  ^  ®-noise 
by  a  Gaussian  Hyper  Basis  Functions. 
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